You can download the raw source code for these lecture notes here.
We will be using the developing tidymodels framework
this week for integrating with the different machine-learning libraries
in a consistent manner. You can install this package from the usual
RStudio Tools menu. All of the other tools that are strictly necessary
for clustering are available in base R. For full flexibility, however,
the ggdendro and heatmaply packages are
recommended. If you want to explore further possibilities, look at the
cluster and protoclust packages.
As you work through the breakout sessions, you will occasionally get
error messages asking you to install other packages, too. Install
whatever R asks for from the Tools menu, and then try running the chunk
again. Two helper functions are also included here:
get_conf_mat() and get_pr().
The missing files from the class corpus have now been added, as well
as labels for which tracks were AI-generated and which not. You should
re-download the class corpus, class corpus features, and
compmus2025.csv from Canvas.
library(tidyverse)
## โโ Attaching core tidyverse packages โโโโโโโโโโโโโโโโโโโโโโโโ tidyverse 2.0.0 โโ
## โ dplyr 1.1.4 โ readr 2.1.5
## โ forcats 1.0.0 โ stringr 1.5.1
## โ ggplot2 3.5.1 โ tibble 3.2.1
## โ lubridate 1.9.4 โ tidyr 1.3.1
## โ purrr 1.0.4
## โโ Conflicts โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ tidyverse_conflicts() โโ
## โ dplyr::filter() masks stats::filter()
## โ dplyr::lag() masks stats::lag()
## โน Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidymodels)
## โโ Attaching packages โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ tidymodels 1.3.0 โโ
## โ broom 1.0.7 โ rsample 1.2.1
## โ dials 1.4.0 โ tune 1.3.0
## โ infer 1.0.7 โ workflows 1.2.0
## โ modeldata 1.4.0 โ workflowsets 1.1.0
## โ parsnip 1.3.1 โ yardstick 1.3.2
## โ recipes 1.1.1
## โโ Conflicts โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ tidymodels_conflicts() โโ
## โ scales::discard() masks purrr::discard()
## โ dplyr::filter() masks stats::filter()
## โ recipes::fixed() masks stringr::fixed()
## โ dplyr::lag() masks stats::lag()
## โ yardstick::spec() masks readr::spec()
## โ recipes::step() masks stats::step()
library(ggdendro)
library(heatmaply)
## Loading required package: plotly
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
##
## Loading required package: viridis
## Loading required package: viridisLite
##
## Attaching package: 'viridis'
##
## The following object is masked from 'package:scales':
##
## viridis_pal
##
##
## ======================
## Welcome to heatmaply version 1.5.0
##
## Type citation('heatmaply') for how to cite the package.
## Type ?heatmaply for the main documentation.
##
## The github page is: https://github.com/talgalili/heatmaply/
## Please submit your suggestions and bug-reports at: https://github.com/talgalili/heatmaply/issues
## You may ask questions at stackoverflow, use the r and heatmaply tags:
## https://stackoverflow.com/questions/tagged/heatmaply
## ======================
source("compmus.R")
get_conf_mat <- function(fit) {
outcome <- .get_tune_outcome_names(fit)
fit |>
collect_predictions() |>
conf_mat(truth = outcome, estimate = .pred_class)
}
get_pr <- function(fit) {
fit |>
conf_mat_resampled() |>
group_by(Prediction) |> mutate(precision = Freq / sum(Freq)) |>
group_by(Truth) |> mutate(recall = Freq / sum(Freq)) |>
ungroup() |> filter(Prediction == Truth) |>
select(class = Prediction, precision, recall)
}
For this work, it is helpful to load the class corpus features in an separate variable.
compmus2025 <- read_csv("compmus2025.csv")
## Rows: 108 Columns: 7
## โโ Column specification โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
## Delimiter: ","
## chr (1): filename
## dbl (5): arousal, danceability, instrumentalness, tempo, valence
## lgl (1): ai
##
## โน Use `spec()` to retrieve the full column specification for this data.
## โน Specify the column types or set `show_col_types = FALSE` to quiet this message.
In the tidyverse approach, we can preprocess data with a
recipe specifying what we are predicting and what variables
we think might be useful for that prediction. For most projects, the
track name will be the best choice (although feel free to experiment
with others). The code below uses str_trunc to clip the
track name to a maximum of 20 characters, again in order to improve
readability. The column_to_rownames command is ugly but
necessary for the plot labels to appear correctly.
Then we use step functions to do any data cleaning
(usually centring and scaling, but step_range is a viable
alternative that squeezes everything to be between 0 and 1). This week
we discussed that although standardising variables with
step_center to make the mean 0 and step_scale
to make the standard deviation 1 is the most common approach, sometimes
step_range is a better alternative, which squashes or
stretches every features so that it ranges from 0 to 1.Itโs wise to try
both.
cluster_juice <-
recipe(
filename ~
arousal +
danceability +
instrumentalness +
tempo +
valence,
data = compmus2025
) |>
step_center(all_predictors()) |>
step_scale(all_predictors()) |>
# step_range(all_predictors()) |>
prep(compmus2025) |>
juice() |>
column_to_rownames("filename")
When using step_center and step_scale, then
the Euclidean distance is usual. When using step_range,
then the Manhattan distance is also a good choice: this combination is
known as Gowerโs distance and has a long history in
clustering.
After you have this section of the notebook working with Euclidean distance, try modifying it to use Gowerโs distance.
compmus_dist <- dist(cluster_juice, method = "euclidean")
There are three primary types of linkage: single, average,
and complete. Usually average or complete give the best results. We can
use the ggendrogram function to make a more standardised
plot of the results.
compmus_dist |>
hclust(method = "single") |> # Try single, average, and complete.
dendro_data() |>
ggdendrogram()
Try all three of these linkages. Which one looks the best? Which one sounds the best (when you listen to the tracks on Spotify)? Can you guess which features are separating the clusters?
Especially for storyboards, it can be helpful to visualise
hierarchical clusterings along with heatmaps of feature values. We can
do that with heatmaply. Although the interactive heatmaps
are flashy, think carefully when deciding whether this representation is
more helpful for your storyboard than the simpler dendrograms above.
heatmaply(
cluster_juice,
hclustfun = hclust,
hclust_method = "average", # Change for single, average, or complete linkage.
dist_method = "euclidean"
)
## Warning in doTryCatch(return(expr), name, parentenv, handler): unable to load shared object '/Library/Frameworks/R.framework/Resources/modules//R_X11.so':
## dlopen(/Library/Frameworks/R.framework/Resources/modules//R_X11.so, 0x0006): Library not loaded: /opt/X11/lib/libSM.6.dylib
## Referenced from: <E05F369D-F578-3218-8770-D28AA193A062> /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/modules/R_X11.so
## Reason: tried: '/opt/X11/lib/libSM.6.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/opt/X11/lib/libSM.6.dylib' (no such file), '/opt/X11/lib/libSM.6.dylib' (no such file), '/Library/Frameworks/R.framework/Resources/lib/libSM.6.dylib' (no such file), '/Library/Java/JavaVirtualMachines/jdk-11.0.18+10/Contents/Home/lib/server/libSM.6.dylib' (no such file)
Can you determine from the heatmap which features seem to be the most and least useful for the clustering? Try modifying the recipe to find the most effective combination of features.
In order to demonstrate some of the principles of classification, we will try to identify features to distinguish that AI-generated and non-AI-generated tracks in the class corpus. For a full analysis, we would need to delve deeper, but letโs start with the features we used in the first week.
After you have this section of the notebook working, try using other combinations of features.
We need to filter out the tracks from students who did not specify whether their tracks were AI-generated.
compmus2025_filtered <-
compmus2025 |> filter(!is.na(ai)) |>
mutate(ai = factor(if_else(ai, "AI", "Non-AI")))
As you think about this lab session โ and your portfolio โ think about the four kinds of validity that Sturm and Wiggins discussed in our reading for last week. Do these projects have:
Remember that in the tidyverse approach, we can
preprocess data with a recipe. In this case, instead of a
label for making the cluster plots readable, we use the label for the
class that we want to predict.
classification_recipe <-
recipe(
ai ~
arousal +
danceability +
instrumentalness +
tempo +
valence,
data = compmus2025_filtered
) |>
step_center(all_predictors()) |>
step_scale(all_predictors()) # Converts to z-scores.
# step_range(all_predictors()) # Sets range to [0, 1].
The vfold_cv function sets up cross-validation. We will
use 5-fold cross-validation here in the interest of speed, but 10-fold
cross-validation is more typical.
compmus_cv <- compmus2025_filtered |> vfold_cv(5)
Your optional DataCamp tutorials this week introduced four classical
algorithms for classification: \(k\)-nearest neighbour, naive Bayes,
logistic regression, and decision trees. Other than naive Bayes, all of
them can be implemented more simply in tidymodels.
A \(k\)-nearest neighbour classifier often works just fine with only one neighbour. It is very sensitive to the choice of features, however. Letโs check the performance as a baseline.
knn_model <-
nearest_neighbor(neighbors = 1) |>
set_mode("classification") |>
set_engine("kknn")
classification_knn <-
workflow() |>
add_recipe(classification_recipe) |>
add_model(knn_model) |>
fit_resamples(compmus_cv, control = control_resamples(save_pred = TRUE))
classification_knn |> get_conf_mat()
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## โน Please use `all_of()` or `any_of()` instead.
## # Was:
## data %>% select(outcome)
##
## # Now:
## data %>% select(all_of(outcome))
##
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Truth
## Prediction AI Non-AI
## AI 29 17
## Non-AI 20 24
These matrices autoplot in two forms.
classification_knn |> get_conf_mat() |> autoplot(type = "mosaic")
classification_knn |> get_conf_mat() |> autoplot(type = "heatmap")
We can also compute precision and recall for each class.
classification_knn |> get_pr()
## # A tibble: 2 ร 3
## class precision recall
## <fct> <dbl> <dbl>
## 1 AI 0.630 0.592
## 2 Non-AI 0.545 0.585
Random forests are a more powerful variant of the decision-tree algorithm you learned about on DataCamp. Although no single classifier works best for all problems, in practice, random forests are among the best-performing off-the-shelf algorithms for many real-world use cases.
forest_model <-
rand_forest() |>
set_mode("classification") |>
set_engine("ranger", importance = "impurity")
indie_forest <-
workflow() |>
add_recipe(classification_recipe) |>
add_model(forest_model) |>
fit_resamples(
compmus_cv,
control = control_resamples(save_pred = TRUE)
)
indie_forest |> get_pr()
## # A tibble: 2 ร 3
## class precision recall
## <fct> <dbl> <dbl>
## 1 AI 0.698 0.612
## 2 Non-AI 0.596 0.683
Random forests also give us a ranking of feature importance,
which is a measure of how useful each feature in the recipe was for
distinguishing the ground-truth classes. We can plot it with
randomForest::varImpPlot. Again, it is clear that timbre,
specifically Component 1 (power) and Component 11, is important. Note
that because random forests are indeed random, the accuracy and feature
rankings will vary (slightly) every time you re-run the code.
workflow() |>
add_recipe(classification_recipe) |>
add_model(forest_model) |>
fit(compmus2025_filtered) |>
pluck("fit", "fit", "fit") |>
ranger::importance() |>
enframe() |>
mutate(name = fct_reorder(name, value)) |>
ggplot(aes(name, value)) +
geom_col() +
coord_flip() +
theme_minimal() +
labs(x = NULL, y = "Importance")
Armed with this feature set, perhaps we can make a better plot. Itโs clear that the running playlist is low on acousticness, but the party music overlaps heavily with it and also spreads out into the wider acousticness and brightness range of Indie Pop.
compmus2025_filtered |>
ggplot(aes(x = valence, y = arousal, colour = ai, size = tempo)) +
geom_point(alpha = 0.8) +
scale_color_viridis_d() +
labs(
x = "Valence",
y = "Arousal",
size = "Tempo",
colour = "AI"
)
Can you get a clearer visualisation by using a different set of the top features from the random forest?
When you are happy with your visualisation, the
ggploty() trick from the first week can work well for these
plots.